NEARSIDE: Structured kNowledge Extraction frAmework from SpecIes DEscriptions

نویسندگان

چکیده

Species descriptions are stored in textual form corpora such as floras and faunas, but this large amount of information cannot be used directly by algorithms, nor can it linked to other data sources. The production knowledge bases expressing structured benefit from collaborative easy-to-use platforms like Xper3 (Vignes-Lebbe et al. 2017, Kerner Vignes 2019, Saucède 2021) is very time-consuming at the human level. It therefore mandatory for task make contained species measurable compatible with computer techniques. One most structures on web deep learning community triplet structure. Each piece represented a set 3 elements (subject, predicate, object). first steps towards accessibility developing text-to-triplet model, also known text-to-graph, monograph descriptions. In work, we developed NEARSIDE, text-to-graph model adapted biology create normalized morphological characteristic Natural Language Processing, models have proven effective extracting open domain (Lample 2016, Sutskever 2014), especially since emergence attention-based (Devlin 2019b, Devlin 2019a). Several works been made biomedical (Fries 2017,Cho Lee 2019). our case, propose floras. Fully supervised require annotated training, nevertheless, annotation process implies an expensive intervention. Distant supervision technique that reduce cost. This paradigm uses small glossary project classes word level new complex longer text (see Fig. 1). Named Entity Recognition (NER) Processing (NLP) consists classifying words interest (Sutskever 2014, Lample 2016), while extraction compared Relation Extraction (RE) which semantic relations between pairs words. Distantly NER often studied subject literature comparison distantly RE (Liang 2020, Meng simply because subtask distant annotations generation less 2). Our contribution creating description dataset well-balanced test allows us bypass several biases induced observed datasets (Taillé 2021). dataset, each will classified into one 15 classes, class being specific kind organ or descriptor. second proposing trained fauna flora particularly long use technical vocabulary. We develop context-oriented pretraining language model. Thus encoder provides contextualized vectors extracted measure similarities different species. reaches 96% accuracy named entity classification set. third construction module applied model's outputs. based dependency rules inspired Xper3’s representation format 3). Finally, NEARSIDE end-to-end framework unstructured corpora, making easily linked, measured.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Knowledge Extraction from Structured Sources

This chapter surveys knowledge extraction approaches from structured sources such as relational databases, XML and CSV. A general definition of knowledge extraction is devised that covers structured as well as unstructured sources. We summarize current progress on conversion of structured data to RDF and OWL. As an example, we provide a formalization and description of SparqlMap, which implemen...

متن کامل

A framework for structured knowledge extraction and representation from natural language via deep sentence analysis

We present a framework that we are currently developing, that allows one to extract knowledge from natural language sentences using a deep analysis technique based on linguistic dependencies. The extracted knowledge is represented in OOLOT, an intermediate format that we have introduced, inspired by the Language of Thought (LOT) and based on Answer Set Programming (ASP). OOLOT uses an ontologyo...

متن کامل

Knowledge Extraction from Semi-structured Data Based on Fuzzy Techniques

In this work we propose a fuzzy technique to compare XML documents belonging to a semi-structured flow and sharing a common vocabulary of tags. Our approach is based on the idea of representing documents as fuzzy bags and, using a measure of comparison, evaluating structural similarities between them. Then we suggest how to organize the extracted knowledge in a class hierarchy, choosing a techn...

متن کامل

The Toxicity Material Extraction From Euphorbia Species

Euphorbia is a genus of flowering plants belonging to the family Euphorbiaceae. Consisting of 2008 species. The genus Euphorbia produces an irritant, which constitute a health hazard to humans and livestock. The genus Euphorbia is one of the largest and most complex genera of flowering plants, however, several botanists have made unsuccessful attempts to subdivide it to smaller genera.  Many Eu...

متن کامل

Extracting Knowledge from Biological Descriptions

We describe a system which performs biological identification on the basis of natural language descriptions. The system parses texts containing large sets of biological descriptions in restricted natural language and constructs a knowledge base. The system can semi-automatically adapt to a text by extending its lexicon and perhaps its grammar. The constructed knowledge bases are used to perform...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Biodiversity Information Science and Standards

سال: 2022

ISSN: ['2535-0897']

DOI: https://doi.org/10.3897/biss.6.94297